-
Notifications
You must be signed in to change notification settings - Fork 914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add resplitting functionality to Flower Datasets #2427
Conversation
Apply suggestions Co-authored-by: Daniel J. Beutel <[email protected]>
Maybe the |
Also, one more thing. I don't think the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool functionality. I just left a single comment.
This felt fine when I wast testing the splitter functionality. |
Co-authored-by: Daniel J. Beutel <[email protected]>
Co-authored-by: Daniel J. Beutel <[email protected]>
Co-authored-by: Daniel J. Beutel <[email protected]>
I've switched the keys and values. I also fixed the resplit_dataset_if_needed (I violated the single responsibility principle there and it had complex initialization and then resplit functionality, inti now moved to utils; that also created circular import of Resplitter so I created a common.typing which of which the name can be changed but typing alone didn't work so if common.typing is not ok, then maybe types.py?) |
Issue
The datasets downloaded from Hugging Face come with certain splits. Yet, users might want to use different divisions of the whole dataset, which is currently not possible.
Description
But what if the dataset had three splits "train", "valid", "test" (not 2 like "mnist").
In that case, you might want to have a single dataset from which 10 partitions are created.
And the reverse might hold true for just a single-split dataset. Sometimes datasets have just train split and need to create a centralized dataset. This is currently impossible.
Proposal
Enable the missing functionality described above (and add tests to make sure they are met).
resplitter
keyword in theFederatedDataset
Callable[[DatasetDict], DatasetDict]]
You might perform as sophisticated a change as you wish - just provide this asresplitter
to theFederatedDataset
(Note: all the checks to use that correctly are on the user side)Dict[Tuple[str, ...], str]
(called for convenienceResplitStrategy
). Here is an example{("train", "valid"): "bigger_train"}
. We create a "bigger_train" split from the "train" and "valid" splits. From this object aResplitter
(newly introduced class) is created. That is essentiallyCallable[[DatasetDict], DatasetDict]]
with additional check if the splits are used correctly (you can use only the existing splits, and you cannot create a new dataset that has two splits with the same name)This works as follows:
First option
The second option (that does the same thing) ResplitterStrategy specification => internally Resplitter creation